NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata

https://doi.org/10.1093/bib/bbae652

Yuan, Hao; Hicks, Parker; Ahmadian, Mansooreh; Johnson, Kayla A; Valtadoros, Lydia; Krishnan, Arjun (November 2024, Briefings in Bioinformatics)

Abstract Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https://github.com/krishnanlab/txt2onto2.0.
more » « less
Full Text Available
Integration of 168,000 samples reveals global patterns of the human gut microbiome

https://doi.org/10.1016/j.cell.2024.12.017

Abdill, Richard J; Graham, Samantha P; Rubinetti, Vincent; Ahmadian, Mansooreh; Hicks, Parker; Chetty, Ashwin; McDonald, Daniel; Ferretti, Pamela; Gibbons, Elizabeth; Rossi, Marco; et al (February 2025, Cell)

Full Text Available
Geodesic length and shifted weights in first-passage percolation

https://doi.org/10.1090/cams/18

Krishnan, Arjun; Rassoul-Agha, Firas; Seppäläinen, Timo (May 2023, Communications of the American Mathematical Society)

We study first-passage percolation through related optimization problems over paths of restricted length. The path length variable is in duality with a shift of the weights. This puts into a convex duality framework old observations about the convergence of the normalized Euclidean length of geodesics due to Hammersley and Welsh, Smythe and Wierman, and Kesten, and leads to new results about geodesic length and the regularity of the shape function as a function of the weight shift. For points far enough away from the origin, the ratio of the geodesic length and the ${ℓ<#comment/>}^{1}$ distance to the endpoint is uniformly bounded away from one. The shape function is a strictly concave function of the weight shift. Atoms of the weight distribution generate singularities, that is, points of nondifferentiability, in this function. We generalize to all distributions, directions and dimensions an old singularity result of Steele and Zhang for the planar Bernoulli case. When the weight distribution has two or more atoms, a dense set of shifts produces singularities. The results come from a combination of the convex duality, the shape theorems of the different first-passage optimization problems, and modification arguments.
more » « less
Full Text Available
Accurately modeling biased random walks on weighted networks using node2vec+

https://doi.org/10.1093/bioinformatics/btad047

Liu, Renming; Hirn, Matthew; Krishnan, Arjun (January 2023, Bioinformatics)
Martelli, Pier Luigi (Ed.)
Abstract MotivationAccurately representing biological networks in a low-dimensional space, also known as network embedding, is a critical step in network-based machine learning and is carried out widely using node2vec, an unsupervised method based on biased random walks. However, while many networks, including functional gene interaction networks, are dense, weighted graphs, node2vec is fundamentally limited in its ability to use edge weights during the biased random walk generation process, thus under-using all the information in the network. ResultsHere, we present node2vec+, a natural extension of node2vec that accounts for edge weights when calculating walk biases and reduces to node2vec in the cases of unweighted graphs or unbiased walks. Using two synthetic datasets, we empirically show that node2vec+ is more robust to additive noise than node2vec in weighted graphs. Then, using genome-scale functional gene networks to solve a wide range of gene function and disease prediction tasks, we demonstrate the superior performance of node2vec+ over node2vec in the case of weighted graphs. Notably, due to the limited amount of training data in the gene classification tasks, graph neural networks such as GCN and GraphSAGE are outperformed by both node2vec and node2vec+. Availability and implementationThe data and code are available on GitHub at https://github.com/krishnanlab/node2vecplus_benchmarks. All additional data underlying this article are available on Zenodo at https://doi.org/10.5281/zenodo.7007164. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Expression‐based machine learning models for predicting plant tissue identity

https://doi.org/10.1002/aps3.11621

Palande, Sourabh; Arsenault, Jeremy; Basurto‐Lozada, Patricia; Bleich, Andrew; Brown, Brianna_N I; Buysse, Sophia F; Connors, Noelle A; Das_Adhikari, Sikta; Dobson, Kara C; Guerra‐Castillo, Francisco Xavier; et al (January 2025, Applications in Plant Sciences)

Abstract PremiseThe selection ofArabidopsisas a model organism played a pivotal role in advancing genomic science. The competing frameworks to select an agricultural‐ or ecological‐based model species were rejected, in favor of building knowledge in a species that would facilitate genome‐enabled research. MethodsHere, we examine the ability of models based onArabidopsisgene expression data to predict tissue identity in other flowering plants. Comparing different machine learning algorithms, models trained and tested onArabidopsisdata achieved near perfect precision and recall values, whereas when tissue identity is predicted across the flowering plants using models trained onArabidopsisdata, precision values range from 0.69 to 0.74 and recall from 0.54 to 0.64. ResultsThe identity of belowground tissue can be predicted more accurately than other tissue types, and the ability to predict tissue identity is not correlated with phylogenetic distance fromArabidopsis.k‐nearest neighbors is the most successful algorithm, suggesting that gene expression signatures, rather than marker genes, are more valuable to create models for tissue and cell type prediction in plants. DiscussionOur data‐driven results highlight that the assertion that knowledge fromArabidopsisis translatable to other plants is not always true. Considering the current landscape of abundant sequencing data, we should reevaluate the scientific emphasis onArabidopsisand prioritize plant diversity.
more » « less
Full Text Available
Topological data analysis reveals a core gene expression backbone that defines form and function across flowering plants

https://doi.org/10.1371/journal.pbio.3002397

Palande, Sourabh; Kaste, Joshua_A M; Roberts, Miles D; Segura_Abá, Kenia; Claucherty, Carly; Dacon, Jamell; Doko, Rei; Jayakody, Thilani B; Jeffery, Hannah R; Kelly, Nathan; et al (December 2023, PLOS Biology)
Drost, Hajk-Georg (Ed.)
Since they emerged approximately 125 million years ago, flowering plants have evolved to dominate the terrestrial landscape and survive in the most inhospitable environments on earth. At their core, these adaptations have been shaped by changes in numerous, interconnected pathways and genes that collectively give rise to emergent biological phenomena. Linking gene expression to morphological outcomes remains a grand challenge in biology, and new approaches are needed to begin to address this gap. Here, we implemented topological data analysis (TDA) to summarize the high dimensionality and noisiness of gene expression data using lens functions that delineate plant tissue and stress responses. Using this framework, we created a topological representation of the shape of gene expression across plant evolution, development, and environment for the phylogenetically diverse flowering plants. The TDA-based Mapper graphs form a well-defined gradient of tissues from leaves to seeds, or from healthy to stressed samples, depending on the lens function. This suggests that there are distinct and conserved expression patterns across angiosperms that delineate different tissue types or responses to biotic and abiotic stresses. Genes that correlate with the tissue lens function are enriched in central processes such as photosynthetic, growth and development, housekeeping, or stress responses. Together, our results highlight the power of TDA for analyzing complex biological data and reveal a core expression backbone that defines plant form and function.
more » « less
Full Text Available
Current and future directions in network biology

https://doi.org/10.1093/bioadv/vbae099

Zitnik, Marinka; Li, Michelle M; Wells, Aydin; Glass, Kimberly; Morselli_Gysi, Deisy; Krishnan, Arjun; Murali, T M; Radivojac, Predrag; Roy, Sushmita; Baudot, Anaïs; et al (January 2024, Bioinformatics Advances)
Lengauer, Thomas (Ed.)
Abstract SummaryNetwork biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology. Availability and implementationNot applicable.
more » « less
Full Text Available
RECoN: Rice Environment Coexpression Network for Systems Level Analysis of Abiotic-Stress Response

https://doi.org/10.3389/fpls.2017.01640

Krishnan, Arjun; Gupta, Chirag; Ambavaram, Madana M.; Pereira, Andy (September 2017, Frontiers in Plant Science)

Full Text Available

Search for: All records